{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "m0jRtbRGsoZy"
   },
   "source": [
    "#  Using Reinforcement Learning to Play Pong\n",
    "\n",
    "This tutorial demonstrates using reinforcement learning to train an agent to play Pong.  This task isn't directly related to chemistry, but video games make an excellent demonstration of reinforcement learning techniques.\n",
    "\n",
    "![title](assets/pong.png)\n",
    "\n",
    "## Colab\n",
    "\n",
    "This tutorial and the rest in this sequence can be done in Google Colab (although the visualization at the end doesn't work correctly on Colab, so you might prefer to run this tutorial locally). If you'd like to open this notebook in colab, you can use the following link.\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deepchem/deepchem/blob/master/examples/tutorials/Using_Reinforcement_Learning_to_Play_Pong.ipynb)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 188
    },
    "colab_type": "code",
    "id": "-1kpETs2GnbI",
    "outputId": "dc8d5ae6-a0d7-4236-8168-8b615806ce41"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: deepchem in c:\\users\\hp\\deepchem_2 (2.8.1.dev20240501183346)\n",
      "Requirement already satisfied: joblib in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from deepchem) (1.3.2)\n",
      "Requirement already satisfied: numpy>=1.21 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from deepchem) (1.26.4)\n",
      "Requirement already satisfied: pandas in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages\\pandas-2.2.1-py3.10-win-amd64.egg (from deepchem) (2.2.1)\n",
      "Requirement already satisfied: scikit-learn in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from deepchem) (1.4.1.post1)\n",
      "Requirement already satisfied: sympy in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from deepchem) (1.12)\n",
      "Requirement already satisfied: scipy>=1.10.1 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from deepchem) (1.12.0)\n",
      "Requirement already satisfied: rdkit in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages\\rdkit-2023.9.5-py3.10-win-amd64.egg (from deepchem) (2023.9.5)\n",
      "Requirement already satisfied: python-dateutil>=2.8.2 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from pandas->deepchem) (2.8.2)\n",
      "Requirement already satisfied: pytz>=2020.1 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages\\pytz-2024.1-py3.10.egg (from pandas->deepchem) (2024.1)\n",
      "Requirement already satisfied: tzdata>=2022.7 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages\\tzdata-2024.1-py3.10.egg (from pandas->deepchem) (2024.1)\n",
      "Requirement already satisfied: Pillow in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from rdkit->deepchem) (10.2.0)\n",
      "Requirement already satisfied: threadpoolctl>=2.0.0 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from scikit-learn->deepchem) (3.3.0)\n",
      "Requirement already satisfied: mpmath>=0.19 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from sympy->deepchem) (1.3.0)\n",
      "Requirement already satisfied: six>=1.5 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from python-dateutil>=2.8.2->pandas->deepchem) (1.16.0)\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "No normalization for SPS. Feature removed!\n",
      "No normalization for AvgIpc. Feature removed!\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "WARNING:tensorflow:From c:\\Users\\HP\\anaconda3\\envs\\deep\\lib\\site-packages\\keras\\src\\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.\n",
      "\n",
      "WARNING:tensorflow:From c:\\Users\\HP\\anaconda3\\envs\\deep\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:588: calling function (from tensorflow.python.eager.polymorphic_function.polymorphic_function) with experimental_relax_shapes is deprecated and will be removed in a future version.\n",
      "Instructions for updating:\n",
      "experimental_relax_shapes is deprecated, use reduce_retracing instead\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'dgl'\n",
      "Skipped loading modules with transformers dependency. No module named 'transformers'\n",
      "cannot import name 'HuggingFaceModel' from 'deepchem.models.torch_models' (c:\\users\\hp\\deepchem_2\\deepchem\\models\\torch_models\\__init__.py)\n",
      "Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'lightning'\n",
      "Skipped loading some Jax models, missing a dependency. No module named 'jax'\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "'2.8.1.dev'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "!pip install --pre deepchem\n",
    "import deepchem\n",
    "deepchem.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 187
    },
    "colab_type": "code",
    "id": "9sv6kX_VsoZ1",
    "outputId": "ce4206d5-7917-4cad-c716-238a41f78e2a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: gym[accept-rom-license,atari] in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (0.26.2)\n",
      "Requirement already satisfied: numpy>=1.18.0 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from gym[accept-rom-license,atari]) (1.26.4)\n",
      "Requirement already satisfied: cloudpickle>=1.2.0 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from gym[accept-rom-license,atari]) (3.0.0)\n",
      "Requirement already satisfied: gym-notices>=0.0.4 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from gym[accept-rom-license,atari]) (0.0.8)\n",
      "Requirement already satisfied: ale-py~=0.8.0 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from gym[accept-rom-license,atari]) (0.8.1)\n",
      "Requirement already satisfied: autorom~=0.4.2 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (0.4.2)\n",
      "Requirement already satisfied: importlib-resources in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from ale-py~=0.8.0->gym[accept-rom-license,atari]) (6.4.0)\n",
      "Requirement already satisfied: typing-extensions in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from ale-py~=0.8.0->gym[accept-rom-license,atari]) (4.9.0)\n",
      "Requirement already satisfied: click in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (8.1.7)\n",
      "Requirement already satisfied: requests in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (2.31.0)\n",
      "Requirement already satisfied: tqdm in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (4.66.2)\n",
      "Requirement already satisfied: AutoROM.accept-rom-license in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (0.6.1)\n",
      "Requirement already satisfied: colorama in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from click->autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (0.4.6)\n",
      "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from requests->autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (3.3.2)\n",
      "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from requests->autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (3.6)\n",
      "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\hp\\anaconda3\\envs\\deep\\lib\\site-packages (from requests->autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (2.2.1)\n",
      "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\hp\\appdata\\roaming\\python\\python310\\site-packages (from requests->autorom~=0.4.2->autorom[accept-rom-license]~=0.4.2; extra == \"accept-rom-license\"->gym[accept-rom-license,atari]) (2022.5.18.1)\n"
     ]
    }
   ],
   "source": [
    "!pip install \"gym[atari,accept-rom-license]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reinforcement Learning\n",
    "\n",
    "Reinforcement learning involves an *agent* that interacts with an *environment*.  In this case, the environment is the video game and the agent is the player.  By trial and error, the agent learns a *policy* that it follows to perform some task (winning the game).  As it plays, it receives *rewards* that give it feedback on how well it is doing.  In this case, it receives a positive reward every time it scores a point and a negative reward every time the other player scores a point.\n",
    "\n",
    "The first step is to create an `Environment` that implements this task.  Fortunately,\n",
    "OpenAI Gym already provides an implementation of Pong (and many other tasks appropriate\n",
    "for reinforcement learning).  DeepChem's `GymEnvironment` class provides an easy way to\n",
    "use environments from OpenAI Gym.  We could just use it directly, but in this case we\n",
    "subclass it and preprocess the screen image a little bit to make learning easier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "EuRrb3vpsoZ_"
   },
   "outputs": [],
   "source": [
    "import deepchem as dc\n",
    "import numpy as np\n",
    "\n",
    "\n",
    "class PongEnv(dc.rl.GymEnvironment):\n",
    "  def __init__(self):\n",
    "    super(PongEnv, self).__init__('Pong-v4')\n",
    "    self._state_shape = (80, 80)\n",
    "\n",
    "  @property\n",
    "  def state(self):\n",
    "    # Crop everything outside the play area, reduce the image size,\n",
    "    # and convert it to black and white.\n",
    "    state_array = self._state\n",
    "    cropped = state_array[34:194, :, :]\n",
    "    reduced = cropped[0:-1:2, 0:-1:2]\n",
    "    grayscale = np.sum(reduced, axis=2)\n",
    "    bw = np.zeros(grayscale.shape, dtype=np.float32)\n",
    "    bw[grayscale != 233] = 1\n",
    "    return bw\n",
    "\n",
    "  def __deepcopy__(self, memo):\n",
    "    return PongEnv()\n",
    "\n",
    "env = PongEnv()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "GNnO3MZ_soaG"
   },
   "source": [
    "Next we create a model to implement our policy.  This model receives the current state of the environment (the pixels being displayed on the screen at this moment) as its input.  Given that input, it decides what action to perform.  In Pong there are three possible actions at any moment: move the paddle up, move it down, or leave it where it is.  The policy model produces a probability distribution over these actions.  It also produces a *value* output, which is interpreted as an estimate of how good the current state is.  This turns out to be important for efficient learning.\n",
    "\n",
    "The model begins with two convolutional layers to process the image.  That is followed by a dense (fully connected) layer to provide plenty of capacity for game logic.  We also add a small Gated Recurrent Unit (GRU).  That gives the network a little bit of memory, so it can keep track of which way the ball is moving.  Just from the screen image, you cannot tell whether the ball is moving to the left or to the right, so having memory is important.\n",
    "\n",
    "We concatenate the dense and GRU outputs together, and use them as inputs to two final layers that serve as the\n",
    "network's outputs.  One computes the action probabilities, and the other computes an estimate of the\n",
    "state value function.\n",
    "\n",
    "We also provide an input for the initial state of the GRU, and return its final state at the end.  This is required by the learning algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "BLdt8WAQsoaH"
   },
   "outputs": [],
   "source": [
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "\n",
    "class PongPolicy(dc.rl.Policy):\n",
    "    def __init__(self):\n",
    "        super(PongPolicy, self).__init__(['action_prob', 'value', 'rnn_state'], [np.zeros(16, dtype=np.float32)])\n",
    "\n",
    "    def create_model(self, **kwargs):\n",
    "        class TestModel(nn.Module):\n",
    "            def __init__(self):\n",
    "                super(TestModel, self).__init__()\n",
    "                # Convolutional layers\n",
    "                self.conv1 = nn.Conv2d(1, 16, kernel_size=8, stride=4)\n",
    "                self.conv2 = nn.Conv2d(16, 32, kernel_size=4, stride=2)\n",
    "                self.fc1 = nn.Linear(2048, 256)\n",
    "                self.gru = nn.GRU(256, 16, batch_first = True)\n",
    "                self.action_prob = nn.Linear(272, env.n_actions)\n",
    "                self.value = nn.Linear(272, 1)\n",
    "            def forward(self, inputs):\n",
    "                state = (torch.from_numpy((inputs[0])))\n",
    "                rnn_state = (torch.from_numpy(inputs[1]))\n",
    "                reshaped = state.view(-1, 1, 80, 80)\n",
    "                conv1 = F.relu(self.conv1(reshaped))\n",
    "                conv2 = F.relu(self.conv2(conv1))\n",
    "                conv2 = conv2.view(conv2.size(0), -1)\n",
    "                x = F.relu(self.fc1(conv2))\n",
    "                reshaped_x = x.view(1, -1, 256)\n",
    "                #x = torch.flatten(x, 1)\n",
    "                gru_out, rnn_final_state = self.gru(reshaped_x, rnn_state.unsqueeze(0))\n",
    "                rnn_final_state = rnn_final_state.view(-1,16)\n",
    "                gru_out = gru_out.view(-1, 16)\n",
    "                concat = torch.cat((x, gru_out), dim=1)\n",
    "                #concat = concat.view(-1, 272)\n",
    "                action_prob = F.softmax(self.action_prob(concat), dim=-1)\n",
    "                value = self.value(concat)\n",
    "                return action_prob, value, rnn_final_state\n",
    "        return TestModel()\n",
    "policy = PongPolicy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YU19h0aUsoaN"
   },
   "source": [
    "We will optimize the policy using the Advantage Actor Critic (A2C) algorithm.  There are lots of hyperparameters we could specify at this point, but the default values for most of them work well on this problem.  The only one we need to customize is the learning rate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Fw_wu511soaO",
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import torch.nn.functional as F\n",
    "from deepchem.rl.torch_rl.torch_a2c import A2C\n",
    "\n",
    "from deepchem.models.optimizers import Adam\n",
    "a2c = A2C(env, policy, model_dir='model', optimizer=Adam(learning_rate=0.0002))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "-PUD4JG2soaU"
   },
   "source": [
    "Optimize for as long as you have patience to.  By 1 million steps you should see clear signs of learning.  Around 3 million steps it should start to occasionally beat the game's built in AI.  By 7 million steps it should be winning almost every time.  Running on my laptop, training takes about 20 minutes for every million steps."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Wa18EQlmsoaV"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\HP\\anaconda3\\envs\\deep\\lib\\site-packages\\gym\\utils\\passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)\n",
      "  if not isinstance(terminated, (bool, np.bool8)):\n"
     ]
    }
   ],
   "source": [
    "# Change this to train as many steps as you have patience for.\n",
    "a2c.fit(1000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_xHNjusSsoaa"
   },
   "source": [
    "Let's watch it play and see how it does! "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Ud6DB_ndsoab"
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "c:\\Users\\HP\\anaconda3\\envs\\deep\\lib\\site-packages\\gym\\utils\\passive_env_checker.py:289: UserWarning: \u001b[33mWARN: No render fps was declared in the environment (env.metadata['render_fps'] is None or not defined), rendering may occur at inconsistent fps.\u001b[0m\n",
      "  logger.warn(\n"
     ]
    }
   ],
   "source": [
    "# This code doesn't work well on Colab\n",
    "env.reset()\n",
    "while not env.terminated:\n",
    "    env.env.render()\n",
    "    env.step(a2c.select_action(env.state))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "3MGK4nrhsoah"
   },
   "source": [
    "# Congratulations! Time to join the Community!\n",
    "\n",
    "Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:\n",
    "\n",
    "## Star DeepChem on [GitHub](https://github.com/deepchem/deepchem)\n",
    "This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.\n",
    "\n",
    "## Join the DeepChem Gitter\n",
    "The DeepChem [Gitter](https://gitter.im/deepchem/Lobby) hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "18_Using_Reinforcement_Learning_to_Play_Pong.ipynb",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
